Cost of Living Throughout America
Final Project
Data Science 1 with R (STAT 301-1)
Introduction
In a comprehensive exploration of the EPI Family Budget dataset, this report delves into the intricate dynamics of cost of living variations across geographical regions, shedding light on the nuanced relationships between family budgets, income disparities, and metro classifications in the United States.
I was originally motivated to perform this analysis, as I think it is interesting and beneficial to understand and see how the cost of living differs not only on the level of state vs state, but also looking further into the issue by seeing how cost of living differs by family size and county location. Additionally, I think this would be an opportunity to learn how incomes levels and expenses in each county and state differ from each other,starting to understand why these differences are present. I further think that by looking additionally at minimum wage in each state, I think that the analysis I conducted will bring attention to a lot of the inequity present in the United States of America.
In terms of initial curiosities while conducting early research on this data, I was interested in looking at how a median family income for each county then correlates to the total annual expenses on the state, regional, and metro levels. Furthermore, to then see if there are any patterns or trends in budget allocation that stand out or are different from others on multiple levels, such as the county, metro, state, and regional levels. Moving from looking at annual to monthly numbers, I was interested in seeing if there is any change between the two calculations, and if so, how that changes the overall cost of living in the different areas of interest that I have presently stated. Through having these starting curiosities, I was able to conduct a full exploration of the data that focused on comparisons and correlations to help bring insight into how different categories of expenses are valued and allocated in relation to the total expenses and cost of living in different geographical locations.
As stated above I will be using the Economic Policy Institute’s data on family budget, which also then corresponding to telling us about the cost of living in each county in America. This dataset provides insights into the average economical weights and costs of different aspects of life for each county in America both annually and monthly, whiles dividing these averages further by also looking at the different family types as well, ranging from 1 parent and no children families to 2 parent 4 children families. In order to enhance the dataset for my own research, I added information on the geographical region in which each county is located based on their state (south, midwest, northeast, and west) and added the minimum wage of each pair of state and county. The former information was sourced from the USA Census Website, and the latter was sourced from Paycom.com. See References for additional information and citation on the Economic Policy Institute’s dataset, as well as the extra information I obtain for the addition of my region and minimum wage variables.
In terms of the layout of my report, I will first discuss and provide an overview and quality check of my data, being descriptive of how the data looks and how I formatted it in the best way for my own research. I will then start my main explorations, where I will discuss on early univariate and bivariate a analyses in which I conducted, and then turn the attention towards three separate main questions that constructed and established the flow of my exploratory data analysis. Lastly, I will conclude with a summary of the main insights that I have founded throughout my research, as well as discussing potential directions that the analyses in which I conducted can be taken to the next level.
Data Overview & Quality
The FBI Family Budget dataset in its original state consisted of 27 variables and 31,430 observations. Within this, there were twenty-three numerical variables and four categorical variables. However, I did add my own minimum wage and regional variables, as well as the changed the variable type of the family type and metropolitan status variables. Additionally, I made sure that I tackled how I was going to work around the missingness that was in my dataset. Therefore, after further investigation, I realized that all of the missing values for my variables corresponded to one specific county and its multiple different family cases. Thus in this case, I decided it would be best to fully remove the observations of that particular county from my dataset, as I felt leaving it in would case more problems in terms of furthering my analysis than taking it out. Thus my updated version of the dataset includes 29 variables with 31,420 observations. Within this, there are six categorical variables and twenty-three numerical variables. Thus after my manipulation, the dataset is of high quality, being extremely well-made and will be easy to use during my data analyses as there are no underlying issues or problems.
Explorations
Welcome to the heart of my analysis – a comprehensive exploration of the EPI Family Budget dataset. In this section, we embark on a journey through the intricate layers of data, unraveling the complexities of cost-of-living variations across multiple different geographical facets in the United States.
Before diving into our main questions, let’s briefly revisit the insights gleaned from our preliminary analyses. Unraveling the individual and paired variables that provided me with a foundational understanding of the dataset’s landscape.
Additionally, I would like to preference that in this section, I have presented the most pivotal figures that encapsulate the core analyses driving my insights sand narrative. However, whiles these highlighted figures capture the essence of my findings, I acknowledge that a comprehensive view may be desired. Therefore, for the complete array of visuals generated during my explorations, including supplementary analyses and detailed breakdowns, please refer to Appendix I where a comprehensive collection of all figures not shown in this section will be displayed.
Univariate Analysis
In terms of my univariate analysis, I looked at both the categorical and numerical variables, finding the most interesting statistics and figures within my analysis of the different numerical categories of expenses.
However before looking into my categories of expenses, I believe it is important to highlight the difference in the amount of nonmetro areas to metro areas in the dataset to gauge if this geographical difference will have any impact of how I view and analyze my findings in the future.
Above in Figure 1 we see that there are a lot more instances of counties being in nonmetropolitan areas than to that of metropolitan areas. I am interested to see how this will affects aspects such as transportation and healthcare as there are heavy implications on how being further from a metro area can cause for more travel to gain necessitate items sometimes, as well as it seems that families who are further away from hospitals or don’t have such as an abundance of hospitals to them as though in extremely urban and metro areas, might go to the hospital less often. So I was really excited to look more into these relationships. Additionally, from this we can then also compare metropolitan areas in the south to that of to the north and same with nonmetropolitan areas in each region to gauge if geographical region matters more than metro status or vice versa.
Looking at the categories of expenses, I originally wanted to focus on and expand my research mostly on the total annual and monthly, transportation annual and monthly, healthcare annual and monthly, and housing annual and monthly costs. Below I have provided a brief explanation of the distribution of each expense at the national level. A breakdown of the other categories of expenses is can be founded in Appendix I - Univariate Analysis.
In Figure 2 we see that the distribution of healthcare annual expenses has a extremely large spread in comparison to the other variables at the annual level. Within that plot, there is seems to be a symmetric mutlimodal shape with the average costs of healthcare on the annual level being around $12000. However even outside of this average value, there are still smaller significant subgroups consisting of average healthcare costs being around $6000 and $20000. Some early potential reasons that I feel cause this distribution could correlate to family size and location, as well as how the minimum wage rate and median family income relate to these higher healthcare expenses. Expanding on this we then can look at distribution of annual housing cost and we see that there is a symmetric right-skewed distribution as most families tend to spend around $12000 on housing annually. I am surprised that there isn’t a larger spread, as I know that housing in cities tend to be more expensive than housing in non-metropolitan areas, as well as different regions have different housing market demands. The distribution of transportation expenses produces a unimodal right-skewed shape as on average most families spend $13000 a year on transportation costs. I am not surprised by the lack of spread in this distribution as most families regardless of location spend a lot of many on car expenses each year, however I wan to see if metropolitan status creates any difference at all in the type of distributions presented. Lastly, in terms of the annual variables, the distribution of annual total costs spent on a nationwide level has a bimodal and slightly right-skewed shape as on average most families spend around $60000 a year. Within this plot of total annual expenses, we expect to and see that although we have our average value, there is a lot of spread and variation away from this average that we most account for, relating to state and regional differences.
Turning our attention the distributions of the same variables above but now at the monthly level, we see in Figure 3 as expected similar distributions trends to those in which I pointed out before. For example, looking at the distribution of healthcare costs monthly, we see lot of variability in the average expenses that healthcare is monthly, alongside a mutlimodal slightly right-skewed shape, with an average cost around $1200 a month. In terms of monthly housing expenses, the plot showcases that on average, families spend about $900 on housing, with some special cases of families spending over $2000 a month, as our distribution produces a unimodal right-skewed shape. For our transportation distribution, we see again a right-skewed unimodal shape with families on average spending $1200 each month. Lastly looking at total monthly expenses, we also see a pretty large spread in the amount that family types spend monthly at the national level, with the average being around $7000 and shape in the distribution of unimodal and right-skewed. From each of theses plots, the distributions are as expected both in comparison to the annual expenses distributions, as well as when thinking about how these and where the size of the spread for each of the distributions might occur.
Bivariate Analysis
As I made my move to conducting my bivariate analysis, I focused on curating and gaining insights through three main areas. Those being the creation of a correlation matrix, looking specifically at the relationship between median family income and total annual expenses, cost of living, at the national level, and finally bringing the univariate analyses and insights formed to more mircolevels, being at regional, family type, and metro classification levels.
First, I will be discussing the main observations that I found through my correlation matrix, which helped to contribute to the creation of my curiousities for my main explorations.
Figure 4 is a correlation matrix to show the relationship between the different categories of expenses as well as how they relate to median family income and how they impact a counties in-state income ranking. Whiles looking at the in-state income ranking column, it is expected that it would have a negative correlation with the different expenses because as the expenses go up the ranking of the state goes down in term for example going from number 10 to number 1 in the ranking. However, interestingly healthcare seems to have a slightly positive but rather nonexistent relationship with the ranking, meaning that either healthcare expenses don’t play a large part in the ranking system as they are probably consistent across the board nationally or that for some reason an increase in the healthcare expenses in a county would increase its ranking in terms of for example going from 1st to 10th. This is super interesting and I would love to look more into this relationship. Looking at all of the categories of expenses, we see that they correlate strongly to each other, which makes sense due to the fact that an increase in one usually means a increase in the other as cost of living goes up. There are probably also related geographical reasons for this and perhaps some regions have stronger correlations between there different categories of expenses than others. Lastly, I would like to highlight from this plot the interesting relationship between minimum wage and healthcare (both annually and monthly) as they exhibit a slightly negative but almost nonexistent relationship. This could possibly mean that the price of healthcare is centralized around the price of the minimum wage, as I would not expect for healthcare prices to decrease as the minimum wage increases, but rather the exact opposite. I would love to look further into this relationship as well and see how they correlate and if as minimum wage increases does the allocation of one’s total expenses go or less towards healthcare.
Another main area of focus that I wanted to look at in order to help build my curiosities for my main exploration was the relationship between median family income and total annual expense or the cost of living for each county in the United States of America.
Through Figure 5 we get to see a good visual representation of the relationship between median family income and total annual expenses at the broadest level. Within this plot, I found it intriguing but almost expected on how spread out the data was in terms of both how much families’ expenses are per year but also the disparities between income levels in the United States. Additionally, this figure brings attention to how most families’ incomes are less than the total amount of expenses that they have per year. This is already known to be a large problem in the United States as a whole, however I would like to and hope to in my further analysis be able to see if these disparities occur more in one geographical facet than the other to help bring insight into possibly why these disparities even exist.
Lastly in my bivariate analysis I wished to see how each of the different categories of expense, both on the annual and monthly levels, compared at more mircolevels, thus looking at regional, metro, and family type differences. In this section, I will only be discussing the insights that I gained on the distributions of total annual and monthly expenses at these three different levels, as it fits most into my further analyses and curiosities. However, I have also include in Appendix I - Bivariate Analysis visualizations and short descriptions of the distributions of the three other main variables in which I previously discussed (transportation, housing, healthcare) as additional information and insights.
Observing each set of plots, we see that the only real different currently is when we look at the different family types, as the total expenses (both annual and monthly) distributions move more and more to the right, increasing. This makes senses as larger families tend to have more expenses. Otherwise on the geographical and metro status levels, we see little to no differences between total annual and monthly expenses, expect that there is a larger spread of the distribution moving more to the right, increasing, for families that live in a metro area, the Northeast, or West. This recognition is extremely important as although it is commonly known that living in one of these three geographical facets is usually more expensive, this brings attention how this then impact other factors, such as median family income and housing, while also seeing if the allocation of expenses is the same for each geographical facet as well. Thus perhaps combining these different levels will help create breakdown and showcase more meaningful insights.
Conclusion statement on both the bivariate and univariate analyses: Currently I want to look at the main factors causing for the large spread in the our distributions of healthcare both on the monthly and annual level, while also seeing where the breakdown of the total expenses on the monthly and annual levels might break down based on region or state.
Now, with this groundwork established, we can turn our attention to the core questions that I formed to drive my exploration:
1. How does the cost of living vary across different geographical facets and metro classifications?
For myself: first look at the largest level so national, then regional, then the states in every region comparing them. followup question: Are there observable trends in transportation costs based on the availability and accessibility of public transportation in metro areas?
Our first inquiry delves understanding how the geographical tapestry and metro classifications of different areas in America help to uncover the intricate variations in the cost of living. Thus through this question, I aim to unravel the economic nuances that define household budgets.
To start my investigation for this question
Say an intro sentence here:
2. Are there discernible patterns or trends in budget allocation that stand
out, considering family types and monthly expenses?
Building on the foundation of my previous analyses, I now shift focus more heavily on the allocation of expenses. Exploring patterns and trends in budget allocation, I aim to identify distinctive markers that stand out amidst the diverse financial landscape pertaining to the different family types and the relationship between monthly and annual expenses. Thus, looking to see how family size impacts budget allocation. And if there are specific categories where larger families allocate a significantly higher percentage of their budget.
For me: I will work on the heatmap and see how that looks. Again look from the national, regional, and state levels. And then also I want to pick two states one being the best in a category and one being the worst and then look at how the counties compare for those two states as well. Follow-up question: What is the impact of the number of working adults in a family on budget allocation? Do dual-income families allocate their budgets differently compared to single-income families?
intro statement:
3. How does the minimum wage in different states correlate with the affordability of living, particularly in terms of housing, healthcare, and other essential expenses for various family types and regions?
Lastly, I turned my focus to looking at a critical economic variable – minimum wage. Through looking at minimum wages correlates to the cost of living, I aimed to shed light on how variations in minimum wage impact crucial aspects of family budgets, from housing and healthcare to other essential expenses.
Thus through answering these three esstenial questions, I was able to gain interesting insights on the following…..
Conclusions
State conclusions or insights. Were you surprised by things you found or were they as expected? Why? This is a great place for future work, new research questions, and next steps.
References
Economic Policy Institute (2022, March) Family Budget Map. https://www.epi.org/resources/budget/budget-map/
U.S. Census Bureau (2021, October 8). Census Regions and Divisions of the United States. https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf
Paycom (2023, October 2). Your 2023 Guide to Every State’s Minimum Wage. https://www.paycom.com/resources/blog/minimum-wage-rate-by-state/
Appendix I: Extra Explorations
Univariate Analysis
talk the talk
talk the talk
Bivariate Analysis
For all of the figures below, just as in my bivariate analysis we see that the only real different currently is when we look at the different family types, with the categories expenses (both annual and monthly) distributions move more and more to the right, increasing. This makes senses as larger families tend to have more expenses in all of the different categories. Otherwise on the geographical and metro status levels, we see little to no differences between the categories annual and monthly expenses, expect that there is a larger spread of the distribution moving more to the right, increasing, for families that live in a metro area, the Northeast, or West.